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Abstract 

An algorithm is discussed for converting a class of recursive processes to a parallel sys- 
tem. It is argued that this algorithm can be superior to certain methods currently found 
in the literature for an important subset of problems. The cases of homogeneous and non- 
homogenous two term recursion are treated. The basic cost factor of the algorithm over 
non-parallel operations is 2 if only the final values of the sequence is needed and 4 if all 
elements are required. In practice these factors can be reduced considerably. Applications 
to three problems (finding the eigenvalues of a tri-diagonal matrix, the solution of a radial 
wave equation and the solution of a tri-diagonal matrix) are discussed. 
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1 Introduction 



The solution of some problems require a greater number of operations for an appropriate 
parallel algorithm than one that would be used in a strictly serial calculation. A gain in 
speed can still be expected by running them on a parallel machine but there is a cost factor 
since one can not expect to achieve an increase in speed equal to the number of processors 
when compared with the best non-parallel algorithm. 
As an example, consider two term iteration. 

x i+ i = diXi + biXi_x i = 1,2, . . . N (1) 

One method suggested in the literature [1] is to replace the steps in the algorithm by 
matrix multiplication. This algorithm requires extra operations which will be discussed 
further at the end of section 2.1 

Another possible algorithm (the one considered here) is based on the fact that there are 
only two independent solutions to Eq. 1. The proper linear combination of them to represent 
the actual solution can be determined by the starting values. In this paper the application 
of such an algorithm for Eq. 1, as well as a similar one for the iteration when there is an 
additional term, q, on the right hand side, is discussed. 

2 General Description of the Algorithm 

2.1 Homogeneous Case 

A recursion relation, such as Eq. 1, can be viewed as one long sequence of values which 
leads from a beginning pair of values to the end. A desirable procedure for a parallel system 
would be to cut up this sequence into separate strips (as many as there are processors) and 
let each processor work through its part independently. For the first processor there is no 
problem since the starting values are known there. But the second processor (and the rest) 
will not have available their starting values (the final values in the previous processor) so 
this procedure doesn't seem possible. With a moderate expense, however, it can be done. 
Since there are only two independent solutions of Eq. 1 we can construct two (arbitrary 
but independent) solutions, which will provide basis functions, and combine them when 
the starting values for each processor are known from the result of the previous one. For 
simplicity, consider the same algorithm running on all processors ignoring the fact that it 
could be computed more efficiently on the first processor. 

Let the total length of the recursion relation be N + 2 (including the first two starting 
values, hence the 2) with M processors. Each processor will be assigned a recursion of length 
L = N/ M which is supposed to be integer and large. The work to be done by each processor 
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will be proportional to M since it never has to calculate the first two values. In order to see 
how such an algorithm works, let us analyze a (very modest) system of 32 recursion steps to 
be calculated with 4 processors. 

Each processor will do the recursion twice, once with starting values and 1 and once 
with values 1 and 0. That is, each processor calculates the two basis function starting with 
the first two values (1,0) and (0,1). It uses the appropriate values of Oj and 6j for its position 
in the global sequence, of course. For the first processor the basis functions start with 

y 10 = 1; y{° = 0; and y 01 = 0; y» l = 1. (2) 

Since any solution of the recursive formula can be written as a linear combination of the 
two basis functions 

x i = ay\° + Py?, (3) 
we can see from the definition of the initial values in the first processor that 

x = a; xi = (3, (4) 

where xq and x± are the starting values for Eq. 1. We could find all of the values of the 
function in the first processor by calculating 

Xi = x yl° + Xiy® 1 First Processor i — 0, 1, . . . , 8, 9 (5) 

but it is better not to do that immediately. Since each value is independent, we may calculate 
only the last two values if we wish. These would be, in this simple case, x s and x 9 . Notice 
that these are the starting values for the second processor. Thus, for the second processor 
since it started with yl° = 1; y\° = and y® 1 = 0; y^ 1 = 1 choosing the proper linear 
combination (a and (3) to give the true values of x$ and xg (known from the first processor) 
again we could calculate all of the values 

Xi = x s yl° + Xgy® 1 Second Processor % — 8, 9, . . . , 16, 17. (6) 

Again, we need calculate only the last two (x K and xn) to get the starting values for the 
third processor. From these we obtain the last values in the third processor X24 and £25 and 
the fourth processor X32 and £33. Thus, we have found the last two values of the sequence 
with the evaluation (after the parallel computations) of 8 equations. For 4 processors there 
will always be 8 equations regardless of the length of iteration, L, within each processor. 

Table 1 gives the operations explicitly for this small example. In this case each processor 
has only 8 iterations to do. In a more practical example numbers more like 10 6 might be 
expected. The table lists the initial conditions at the top followed by the iterations. The 
generic variable, y, indicates both y m and y 10 are to be calculated. 
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Processor 


Processor 1 


Processor 2 


Processor 3 


l/ 01 = 0; yf = 1 

„,io i . „,io n 

Vo — L , Vi — u 


yl 1 = 0; yf = 1 
„,io i . „,io n 


y§ = 0; y° 17 = 1 
„,io i . „,io n 


2/ 2 u ] = 0; y% = l 

„,io i . „,io n 

1/24 - 1/25 - U 


V2 = a,\V\ + hy 
1/3 = a-iVi + hyi 

Vi = aaVe + b e y 5 
Vs = a 7 y 7 + b 7 y e 
y 9 = a 8 y 8 + b 8 y 7 


1/10 = a 9 t/9 + &91/8 

yn = a-wyio + &101/9 

1/15 = 0142/14 + &141/13 
1/16 = 0152/15 + &152/14 

Vn = a ie yi 6 + b 16 y 15 


Vis = anVn + bnyw 
Vw = oisZ/18 + b 18 yn 

1/23 = 2 2|/22 + &221/21 
1/24 = 023^/23 + &23I/22 
1/25 = 2 4l/24 + &241/23 


1/26 = 2 5l/25 + &251/24 
1/27 = ^261/26 + &261/25 

1/31 = O 3 0l/30 + &301/29 
1/32 = 0311/31 + &311/30 
1/33 = 3 2l/32 + &321/31 



Table 1: Example of the algorithm for a small number of processors. 



After this work has been done the following sequential steps need to be taken using only 
the last two values taken from each processor. 

x 8 = x yl° + x x yf] x 9 = x yl° + x^ 1 
xie = x 8 y[% + x 9 y1l; x 17 = x 8 y\^ + x 9 y§ 

X 2 4 = £161/24 + ^17^24 5 x 25 = X W y^ + X 17 yl\ 
^32 = X2±yll + X25I/32; ^33 = ^241/33 + X 25l/33 

The absolute indices have been used above on the basis functions. It is often more 
convenient to use a combination of the local index and the processor number. The local 
index will be denoted by A and runs from to L + 1. 

In general, only two values in any processor need be computed to find the starting values 
and hence the coefficients of the two basis functions for the next processor. So the serial 
overhead is only twice 3 floating point operations per processor. Even if all of the values 
of the sequence are needed, it is better to do this operation first, because the intermediate 
values can then be found in parallel using the starting values obtained in this way. 

Repeating the above argument for the general case with M processors labeled \i = 
0, 1, 2, . . . , M — 1 and N + 2 total values of the indices of x i: the resulting sequences are 
calculated in a (long) parallel calculation, 

»yl° and »yf- A = 2, 3, . . . , L + 1; [// = 0, 1, . . . , M - 1]. (7) 
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n N = 2 n Sequential m = 2 m = 3 m = 4 m = 5 m = 6 m = 7 m = 8 
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128 


384 





.5625 





.3750 


0.3750 


0, 


.5625 


1 


.0312 


1 


.0000 


1 


.0000 


8 


256 


768 





.5312 


0. 


.3125 


0.2500 


0, 


.3125 





.5312 


1 


.0156 


1 


.0000 


9 


512 


1536 





.5156 





,2812 


0.1875 





,1875 





.2812 


0, 


,5156 


1 


.0078 


10 


1024 


3072 


0. 


.5078 





,2656 


0.1562 





,1250 





.1562 





,2656 


0, 


.5078 


11 


2048 


6144 


0. 


.5039 





,2578 


0.1406 





,0938 





.0938 





,1406 





.2578 


12 


4096 


12288 


0. 


.5020 





,2539 


0.1328 





,0781 





.0625 


0, 


,0781 





.1328 


13 


8192 


24576 


0. 


.5010 





,2520 


0.1289 





,0703 





,0469 





,0469 





.0703 


14 


16384 


49152 





,5005 





.2510 


0.1270 





,0664 





.0391 





.0312 





.0391 


15 


32768 


98304 





.5002 


0. 


.2505 


0.1260 


0, 


,0645 





.0352 





.0234 





.0234 


16 


65536 


196608 





.5001 





.2502 


0.1255 





.0635 





.0332 





.0195 





.0156 


17 


131072 


393216 


0. 


.5001 


0. 


,2501 


0.1252 


0. 


0630 


0, 


0322 


0, 


.0176 


0, 


.0117 


18 


262144 


786432 


0. 


.5000 


0. 


2501 


0.1251 





,0627 





.0317 


0, 


.0166 





.0098 


19 


524288 


1572864 


0. 


5000 


0. 


.2500 


0.1251 





,0626 


0, 


.0315 


0, 


.0161 


0, 


.0088 


20 


1048576 


3145728 


0. 


.5000 


0. 


,2500 


0.1250 


0. 


0626 


0, 


,0314 


0. 


0159 


0, 


.0083 



Table 2: Strip iteration algorithm. Column 3 gives the time to completion for straight 
iteration in one processor in units of the time required for one floating point operation. 
Columns 4 through 10 give the ratio of the time to completion to the corresponding time for 
a purely sequential realization on a single processor (column 3). The number of iterations is 
N = 2 n with a number of processors M = 2 rn . This "power of two" representation is only 
for simplicity and is not needed for the algorithm. 



The square brackets indicate that the calculations for the different values of \i are done in 
parallel. After this step, the equations 



X( P+ i)L = x^ L Vyl° + x^ L+1 ^yf 
x^+i)L+i = x^ L "y^ +1 + x^ L+1 »yf +l 



(8) 

[j,=0,l,. ..,M-l 

are evaluated in a (short) sequential calculation. The number of equations is always twice 
the number of processors. The values of x have been written with absolute indices but we 
may use a notation for individual processors. The pre-superscript fi as used above denotes 
results from a given processor /i so we could write equally valid representations of x as 

x^l+x =^x x . (9) 
In matrix notation we may write Eqs. 8 as 

/u=0,l,...,M-l 
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Table 2 shows the ratios of completion times to the purely sequential case to be expected 
with various recursion lengths and number of processors. The increase in speed over scalar 
is M/2 for large N. One sees that for moderate numbers of iterations the scaling efficiency 
starts to fall of for a number of processors beyond 64. Thus, for most practical problems, the 
algorithm is expected to work best for relatively large numbers of iterations and a modest 
number of processors. This loss comes, of course, because of the need to compute the 
matching relations which requires a time proportional to the number of processors. The 
algorithm could be modified to spread this matching procedure over a number of processors 
but this extention is beyond the scope of the present work. 

If only the end value of the sequence is needed one can stop at this point (the "short 
form" of the algorithm). This factor of two cost is not the best that can be obtained if the 
values of and bi are being calculated along with the iteration. The same values of these 
coefficients are used in each iteration and, if the time for the calculation of the coefficients 
is significant, the overhead to calculate two iterations rather than one (as would happen if 
the calculation were not in parallel) may be small. 

The latency part of the communication time is proportional to L = M/N, so, for a fixed, 
moderate, number of processors and large iV it may be made very small. There are four 
words per processor to be sent in order to make the connection between segments. 

At this point (if needed) one can proceed to calculate the entire sequence of values in a 
second parallel calculation (the "long form" of the algorithm). These will be given by 



x x = x liL+x = x»l "yf + x^ L+1 "yf; X = 2, 3, . . . , L + 1; \pi = 0, 1, . . . , M - 1] (11) 



These evaluations come at the cost of an additional 3L floating point operations per 
processor, roughly a cost factor of 4 i.e., the speed increase is M/4 compared to the pure 
sequential algorithm. This cost can be reduced greatly in certain cases as we shall see later. 
It may be useful to leave the strip functions (or even the strip basis functions) in the processor 
where they were calculated. 

We can now compare with the matrix algorithm mentioned in the introduction. If we 
define 





then 




(13) 



and the end member of the sequence is given by 




(14) 
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The multiplication of matrices can be done pairwise on different processors. The first 
multiplication of the simple matrices can be done with two multiplications and one addition 
but this operation generates full two by two matrices so that the second step in the pairwise 
reduction of the multi-factor product is a complete matrix multiplication and requires 8 
multiplications and 4 additions. When compared with the 2 multiplications and 1 addition 
necessary to actually do the iteration, one sees that there is a cost of a factor of 4 which 
is paid in this case. Assuming 12 operations for all steps the maximum speed up with M 
processors is M/4. The matrix algorithm gives only the end point of the sequence (i.e. not 
the intermediate values) so is to be compared with M/2 from the algorithm just presented. 
With a modest number of processors, over half of the time of the matrix algorithm is spent 
in the first set of multiplications so that considerable savings can be achieved by considering 
the special case for the first operation. 

This cost factor is not the only problem. The work done in each matrix multiplication 
is not very much (12 floating point operations). Hence, communication must take place 
between processors very often so that message passing time may dominate. 

Another possible problem is that often the entire sequence of Xi is needed. This algorithm 
simply doesn't give it. 



2.2 Inhomogeneous Recursion Relation 

For the inhomogeneous recursion 

x i+ i = a, t Xi + biXi_i + Ci (15) 

3 basis solutions are needed to provide a general representation. For the third basis function 
we can take z®°, defined to have the starting conditions z®° L = z®° L+1 = 0. The general form 
of the solution is thus 

Xl = azl° + (3 zf + 1Z f (16) 

where z}°, z® 1 , and zf 3 all separately satisfy Eq. 15. The requirement that this form satisfies 
the recursion relation is a + [3 + j = 1. 

Using this last expression to replace 7 we can write 

x t = a(zl° - a») + - z?) + *» (17) 

It is easy to show that 

y? = and y?i = *oi-*oo (18) 

satisfy the homogeneous equation (i.e. with q = 0) with the same starting points as the z}° 
and z® 1 , exactly as in the homogeneous case treated before. Thus, the general form can be 
written as 

Xi = ayl° + (3yf + (19) 
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where y}° and y® 1 can be calculated as in the previous section, i.e. without reference to q. 

The procedure in this case is first to calculate in parallel (either along with the ^yl or 
separately) the (long) recursion 



^ u +1 = cifiL+x %° + W "Zx-i + c m l+a; A = 1, 2, 3, . . . , L; [fi = 0, 1, . . . , M - 1]. (20) 

Since both starting values of are zero, the values of the coefficients of yf and y® 1 are 
again just the last values from the previous processor so that the equations 



(21) 



/i=0,l,...,M-2 



or in processor notation, 



(v+^xo =fl %o +^xi »yf 

(M+l) Xl =» Xo M 2/ 10 +i M y 01 +i + ^00 +i 



(22) 



H=0,l,...,M-2 



need to be evaluated in a (short) sequential calculation. 

If needed, the intermediate values in the recursive sequence can now be evaluated in 
parallel. These will be given by 



x. 



»L + x=x liL ^ + x^ +l »yf+»zf- A = 2,3,...,L + 1; [// = 0, 1, . . . , M - 1] (23) 



or 



»x x "y}° "yf +^°°; A = 2, 3, . . . , L + 1; [// = 0, 1, . . . , M - 1] (24) 



If the recursion relation is needed for a large number of functions, q, with the same a» 
and q, then the basis functions y w and y 01 need be calculated only once. 



3 Applications 

3.1 Eigenvalues of a tri-diagonal matrix 

The process of finding eigenvalues of a real tri-diagonal matrix plays a central role in the 
solution of the eigenvalue problem of more general real symmetric matrices. Commonly, 
algorithms are converted to a parallel environment either by having each processor search 
for an eigenvalue (method A) [2] or finding all of the eigenvalues by means of divide and 
conquer algorithms (method B) [3]. It is often useful be able to pick out only a few of the 
eigenvalues (the lowest ones) which is the case considered here. 
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For a symmetric tri-diagonal matrix 
/ 







b 2 





b 2 
a 3 






b 3 



V 



b N - 





O-N- 

b N - 




b N -i 
a N 
b N 



(25) 





b N 

ClN+l j 

(see, for example, Refs. [4, 2]) 



there exists a well-known solution for the eigenvalues, A, 
based on Sturm sequences with bisection which allows the selection of eigenvalues. With the 
definition 

x — 1; X\ = di — A (26) 

the recursion relation 

x i+1 = (a i+1 - A)xi - bjxi-i (27) 

generates the determinant, -D(A), of A — AI as the value of xn+i- The eigenvalues of A 
can be found by locating the zeros of D(A). Furthermore, the number of sign differences 
between successive members in the recursion sequence identifies the eigenvalues in order. 
For example, the lowest eigenvalue occurs at the transition from to 1 sign differences in 
the sequence. The desired transition (and hence eigenvalue) can be found with Newton's 
method of bisecting some A m j„ and A max at each step. 

Implementing this recursion in a parallel fashion is straightforward using the algorithm 
given in Section 2.1. In order to calculate the number of sign differences, the individual 
members of the sequence need to be generated which requires the long form of the algorithm, 
thus seeming to cost a factor of 4 compared to a non-parallelized version. However, a hybrid 
method makes the cost factor closer to 2 than 4. When the difference in sign count has been 
reduced to unity between A min and A max it is known that a single eigenvalue lies in this 
region and that it is the correct one. From this point on, the method needs only the final 
value of the sequence (the determinant itself) which requires only half the time. 

Hence, the algorithm can be thought of as proceeding in two phases. In the first phase 
the desired eigenvalue is isolated by finding two values of the estimated eigenvalue with only 
a single zero of the determinant between them. After that, in the second phase, only the 
value of the determinant is needed and with those values one can estimate a new trial value 
more efficiently than a simple bisection by using 



A = 



A m in | D ( A max ) 


A max | D ( A m i n ) | 




1 + \D(A min )\ 



(28) 



One must be careful of the convergence since it may come about with the trial eigenvalue 
approaching one of the limit eigenvalues without the two limits approaching each other. This 
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improvement in efficiency is available to either the parallel or non-parallel version and tends 
to make the first phase dominant in time consumed. 

A common method of implementing this general algorithm on parallel computers is to 
simply give each processor an eigenvalue to find (method A above). In this case each eigen- 
value is obtained in a purely sequential fashion but the values arrive in a parallel manner. 
In the present algorithm each eigenvalue is calculated in a parallel manner and the values 
arrive one after the other. One improvement which is not generally available to method A 
is due to the availability of useful information after the first and subsequent eigenvalues are 
found in the first phase. It is only necessary to keep a table of the tested values of A vs. 
the corresponding number of sign differences. When the next eigenvalue is to be found, the 
table can be searched for the closest starting values. This table grows as the eigenvalues 
are found. If each processor is finding an eigenvalue starting from the outer bounds of the 
eigenvalue sequence this advantage is not available. 

An important consideration in this algorithm (parallel or non-parallel) is the growing of 
the values with each step so that overflow occurs. One can solve that problem by performing 
a renormalization at regular intervals. In the non-parallel version, when the value of Xj+i 
is observed to exceed some predefined value then both x i+ i and Xi are multiplied by an 
appropriate constant reducing factor. This does not change the number of relative sign 
differences nor the sign of the last value. For the parallel version both basis functions can 
be renormalized in the same way (both must be done at the same time) and the recursion is 
not destroyed since it is only the relative values of the basis functions which determines the 
number of sign differences and the sign of the final value. In the version tested here a check 
for renormalization was made every 16 steps. 

The algorithm was calculated for the matrix defined by 

(H = 0, i = 1,2,...N + 1; b] = i(N + l-i), i = l,2,...N (29) 

with N even. The eigenvalues are known to be the even integers from -N to N as presented 
in Refs. [2, 5]. In addition to the renormalization mentioned above, the entire system 
was renormalized such that the smallest bf was unity. Calculations were made for matrices 
of size up to A = 10,240,000 finding eigenvalues to an accuracy of 1 part in 10 11 . The 
method was tested on a Beowulf cluster with a 100 Mbit Ethernet (using MPICH[6, 7]) 
and the scaling efficiency [defined as Ti/(MT M ) where T M is the time for execution on M 
processors] exceeded 0.96 up through 4 processors. 

The "cost" of the method was calculated by comparing one-processor versions of the 
present algorithm with a simple sequential calculation but with the present algorithm taking 
advantage of the information contained in the table of the number of sign differences vs. trial 
A. For a single eigenvalue the present method takes about 2.5 times longer than method A. 
For 5 eigenvalues it still takes about 20% longer. For 10 eigenvalues it is 0.85 as long, for 20 
it is 0.72 as long and for 40 it requires 0.64 of the time for method A. 
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Procedure A also may suffer from an incommensurability with the number of processors. 
If one wishes 8 eigenvalues with 64 processors then only 1/8 of the capacity of the machine is 
being used. The calculation of the relative efficiency of the methods can be quite complicated, 
however. For example, if one wishes 16 eigenvalues with 8 processors then two passes will be 
made with method A and in the second pass tables of values accumulated in the first pass 
can be used. Incorporating this information could well make method A faster. 



3.2 Solving wave equations 

Wave equations (the Schrodinger equation is considered here) can be solved, after an expan- 
sion in Legendre polynomials, by means of one-dimensional second order differential equa- 
tions. An accurate solution can be obtained with Noumerov's method where the iteration 
equations for the reduced wave function are given by 

2i/ji - ipi-i - fj [Wwiipi + Wi-iipi-x] 
1 + ^w i+1 

where 



w(r) — k— -jVr - ( 31 ) 



and h is the spacing in the radial variable r. 

This form is converted readily into that considered in Section 2.1 with much of the 
work of the computation of a* and bi being done before the iteration. Keeping k 2 as a free 
parameter and precomputing one vector {1 — h 2 [j^V{r) + ^-^]} the calculation of a* and 
hi requires seven floating point operations. Given that each iteration requires an additional 
three operations, we might expect that the parallel algorithm involves a cost of an increase 
from ten to thirteen operations or about 30% over the purely sequential one. 

In the present case the Schrodinger equation is solved for a bound state. The solution is 
started at the origin with zero for the first point and an arbitrary value for the second point 
and a search is made for a value of the energy (expressed here as k 2 = 2mE/h 2 ) that causes 
the wave function at some large value of r to be zero. 

The problem was treated with a near maximum precomputation of the values of a« and 
&j. Clearly, if one chooses not to compute as much as possible in advance, and hence to 
spend a larger amount of time in the computation of the coefficients, then the addition of 
a second iteration (the expense of this algorithm) would make less difference. Thus, a true 
test of the algorithm relies on a realistic degree of precomputation. 

The calculation was coded and tested for £ = 2. The number of steps taken was 25.2 x 10 6 . 
Because of the boundary conditions of the problem (reduced wave function zero at the origin) 
the calculation of the basis function y 10 is not needed in the first processor. By comparing 
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a one-processor calculation with it computed or not it was found that the cost factor of 
the algorithm was 1.31, in agreement with the 30% increase estimated above. The method 
was tested on the Beowulf cluster and no decrease in scaling efficiency was seen through 16 
processors. 

3.3 Tri-diagonal Matrix Solution 

Consider the recursive solution to a tri-diagonal matrix (size N + 2), at first without any 
parallelism. 



/ a e 
b\ ai e\ 
b 2 a 2 e 2 



V 



x 2 



b N -i 





a N-l 

b N 




ejv-i 
a N 

bN+i 





e N 
a N+ i / 



x N 



( c \ 

Cl 

c 2 



CJV-I 
C N 
V Cjv+i ) 



(32) 



If any (let the first occurrence be at % = k) is zero then the system can be reduced to 
two subsystems. To see this observe that the first equation alone consists of one equation 
in two unknowns, the first two equations correspond to two equations in three unknowns, 
etc. If e k = then adding that equation to the system introduces no new unknown so the 
system of the first k + 1 equations can be solved alone giving the value (among others) of %k- 
In the remaining equations the k th column can be taken to the right hand side so that they 
can be solved. In this case the system is separable. For a symmetric system, if e k = then 
b k+ i = also and the two blocks are completely decoupled. Here it is assumed that this is 
NOT the case so that NO = 0. Thus, we can divide all equations by e,- L or, equivalently, 
we can set ei = 1 in the system we wish to consider. Of course, the conversion of a general 
system to this form entails the cost of one inverse and two multiplications per equation on 
the left hand but needs to be done only once in the case of a number of different right hand 
sides (N + 1 more multiplications are necessary for each right hand side). 

For these reasons, the following system is considered. 
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/ a 1 
61 ai 1 
b 2 a 2 1 



V 



X\ 
%2 



b N -i 









1 

Oat 




1 



XjV_l 

x N 



Cl 

c 2 



CJV-I 
Cat 
V Ctv+i / 



(33) 



Starting from the second row the equations can be expressed as the recursion relation 



- hxi_i + q; i = 1, 2, 3, . . . , N - 1, AT, 



(34) 



where neither the first or last equations have been used. 

The three basis solutions discussed in Section 2.2 (called here fl°, ff 1 and g®°) can be used 
to provide "global" basis functions (global in the sense that they represent the full recursion 
sequence to be distinguished from the strip functions to be discussed shortly) to express the 
solution. The first two basis solutions do not involve q and need only be calculated once for 
many right hand sides. Thus, the solution separates into two parts, somewhat similar to the 
common factorization and back substitution methods. Once we have the basis solutions we 
can apply the conditions implied by the first and last equation to determine the coefficients 
a and j3 in Eq. 19. For the first equation we have 



a x + x 1 = a (a/ 10 + /3/ 01 + g° °) + afl° + Pf? 1 + gf = c (35) 

or 

a^a + (3 = Co. (36) 

From the last equation we have 

a N+1 x N+1 + b N+1 x N = a N+1 (af^ +1 + (5f^ +1 + g™ +1 ) + b N+1 (af™ + (3 f„ + g%) = c N+1 (37) 
or 

®(aN+ifN+i + bN+ifpj) + P( a N+ifN+i + bN+ifw) — °n+i — Ojv+i<?jv+i — b N+ ig°N (38) 

From these two equations we obtain a and (5 and all values of Xt can be obtained from 
Eq. 19. 

As an alternative to Eq. 38 one can iterate one further step with Eq. 34 to obtain 
In+2i fiv+2 ano - 9^n+2 an d use the condition that x^ +2 = to find 
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af 1 N \2 + PfN + 2 = -gN + 2- (39) 

Returning to the parallel considerations, we can express the global basis functions in 
terms of the strip basis solutions in each processor, obtain the three global functions that 
were used in the above algorithm and then calculate the solution. However, it is much more 
efficient to combine the two operations. 

First write the global recursion basis functions in terms of the strip basis functions. 

/S+a = ^° "Vx +r° ^ (40) 

ff L+ x = ^ V +r x »yf (4i) 
= ^ 00 V +^ 00 "yf +^a° (42) 

where the and ^(3 are to be obtained from the matching conditions for fj, = 1,2, M — 1. 

»a 10 = ^a 10 ^y™ +»- 1 p 10 ^yf (43) 
»(3 W = ^a w ^Vl+i +^ 1 /3 10 ^yf +1 (44) 

^a 01 = ^V 01 ^Vl +'" 1 P 01 "'Vl (45) 
^ = M-i a oi „-i y io +i ^-i^oi M -i y oi +i (46) 

M a oo = M-i a oo M -i y io ^oo ^-i y oi ^oo (47) 

^oo = M-i a oo M -i y io +i ^-i^oo ,-i y oi +i + ^ z oo +i (48) 

with the starting values 

V° = 1; V 1 = 0; °p w = 0; °p 01 = 1; V° = °/3 00 = 0. (49) 

Using the last two values of the global basis functions calculated from Eq. 40-42 we can 
solve for the global a and f3 (from Eq. 36 and 38) to write 

x^l+x = M « M yf +»P "yf +^° x °; A = 2, 3, . . . , L + 1 \n = 0, 1, . . . , M - 1] (50) 
where the coefficients are given by 

"a = a li a 10 + <*a 01 +»a m (51) 

»P = a ^ 10 + ^ 01 +^ 00 (52) 

It is common to compare the relative speed of any algorithm for solving tri-diagonal 
matrices with Gaussian Elimination (GE) which is relatively efficient. For this case GE 
becomes, first for the LU reduction 

d = l/a 
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gi = bidi-i, di = l/(di - gi) i = 1, 2, . . . , N + 1, (53) 
followed by the two back substitutions 

4 = c o; 4 = Ci- g { zl_ x \ i — 1,2, . . . , N + 1, (54) 

and 

x 9 N+l = z 9 N+1 ; xf = (zf - x? +1 )di; i = N, N - 1, . . . , 0. (55) 

If we assume that the equations are being solved for many right hand sides, then we 
should compare the time for the solutions of the z 00 equations and the calculation of the 
vector (Eqs. 20 and 50) with the work of the two back substitutions in GE (Eqs. 54 and 
55). A first estimation can be made for the relative speed by counting the number of floating 
point operations per step (4 for GE and 8 for the parallel algorithm) to get a cost factor of 2. 
This is only a very crude estimate since the form of the equations is different. For example, 
Eq. 50 requires only the broadcast of a scalar instead of vector multiplication. Optimization 
or not of the G77 compiler was observed to make a large difference also. With no optimizing 
GE does better than this estimate giving a cost factor of 2.6. However, with optimization, 
the efficiency of the parallel method is improved more than GE to result in a cost factor of 
1.4. 

To save on message passing for the resultant vector, one may want to use the strips in 
the processor in which they were formed. In some cases it may be more efficient never to 
construct the vectors at all. As an example of such a case, consider the problem of solving the 
set of equations for a large number of different right hand side vectors which are a function 
of some parameter, 77, hence, ^(77). Suppose also that we wish the sum (an integral perhaps) 
of some weighting function over the solution 

N+l 

s (v)= w i x i(v) (56) 

i=0 

as a function of 77. The sum can be distributed among the strip basis functions in the 
processors via Eq. 50. The y basis function integrals need be calculated only once. The 
z 00 integral can be calculated as this basis function is generated. Only the strip integrals 
need to be sent to the master processor to be combined with the coefficients and ^(3. The 
calculation of the solution (Eq. 50) is not needed. In this special case, a count of the number 
of floating point operations estimates the speed to be the same as GE (in the limit of large 
N and large number of values of 77). In one-processor tests, because of the simplicity of the 
equations mentioned above, the strip algorithm was found to run somewhat faster than GE. 

Large systems (720, 720 x k with k = 10, 20, 40, 80, and 160) with 100 values of 77 were 
tested on the 16 processor Beowulf cluster. Essentially no degradation of performance was 
seen with all scaling efficiencies > 0.99. The largest system tested (N = 115, 315, 200) could 
only be run by spreading the solution basis vectors over 13 processors. 
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A common algorithm discussed in the literature is the parallel cyclic reduction of a 
matrix[8]. The basic cost of this algorithm has been reported to be a factor of 4 [9, 10]. It 
requires frequent exchange of information and is not very efficient for multiple right hand 
sides. The "divide and conquer" method [8] is also inefficient for multiple driving terms. 
Hence, the technique presented here would seem to offer an attractive alternative to these 
methods in some cases. 

The restriction to ^ may prove to be inconvenient in some cases or the division may 
lead to large errors. Tests with e, = 1 showed that the stability of the method was as good 
or better than GE. 



4 Discussion 

These algorithms may also be useful on vector machines. For a processor with a 64 word 
vector register, for the case of the homogeneous recursion relation, the total length can be 
broken into 32 strips with each pair of words in the vector register acting as a processor. 
Thus, the iteration might take place as 
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(57) 



for i = 1, 2, . . . L. 



The method can be generalized for a larger number of terms in the iteration (leading to 
larger width in banded matrices, for example). 

There are clearly some limitations to the application of the algorithm. The conversion to 
a parallel system does not work for recursions non linear in Xi so most classical mechanics 
calculations are not possible with it. 

While the problems treated are different, this method appears to have some overlap with 
the Domain Decomposition techniques for the solution of elliptic differential equations [11]. 
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